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ABSTRACT 


Classifying educational forum posts is a longstanding task 
in the research of Learning Analytics and Educational Data 
Mining. Though this task has been tackled by applying 
both traditional Machine Learning (ML) approaches (e.g., 
Logistics Regression and Random Forest) and up-to-date 
Deep Learning (DL) approaches, there lacks a systematic 
examination of these two types of approaches to portray 
their performance difference. To better guide researchers 
and practitioners to select a model that suits their needs 
the best, this study aimed to systematically compare the 
effectiveness of these two types of approaches for this spe- 
cific task. Specifically, we selected a total of six repre- 
sentative models and explored their capabilities by equip- 
ping them with either extensive input features that were 
widely used in previous studies (traditional ML models) 
or the state-of-the-art pre-trained language model BERT 
(DL models). Through extensive experiments on two real- 
world datasets (one is open-sourced), we demonstrated that: 
(i) DL models uniformly achieved better classification re- 
sults than traditional ML models and the performance dif- 
ference ranges from 1.85% to 5.32% with respect to differ- 
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ent evaluation metrics; (ii) when applying traditional ML 
models, different features should be explored and engineered 
to tackle different classification tasks; (iii) when applying 
DL models, it tends to be a promising approach to adapt 
BERT to the specific classification task by fine-tuning its 
model parameters. We have publicly released our code at 
https: //github.com/1sha49/LL_EDU_FORUM_CLASSIFIERS 
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1. INTRODUCTION 


In the past two decades, researchers have developed a num- 
ber of online educational systems to support learning, e.g., 
Massive Open Online Courses, Moodle, and Google Class- 
room. Though being widely recognized as a more flexible 
option compared to campus-based education, these systems 
are often limited by their asynchronous mode of delivery 
that may hinder effective interaction between instructors 
and students and between students themselves [27, 20]. As 
a remedy, the discussion forum component is often included 
to support communication between instructors and class- 
mates, so students can create posts for different purposes, 
e.g., to ask questions, express opinions, or seek technical 
help. Moreover, in certain cases, instructors rely heavily on 
the use of a discussion forum to promote peer-to-peer col- 
laboration, e.g., specifying a topic to spur discussions among 
students. 


In this context, the timeliness of an instructor’s response to a 
student post becomes critical. A group of studies has demon- 
strated that students’ learning performance and course ex- 
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perience were greatly affected by the timeliness of the re- 
sponses they received from instructors [2, 24, 14]. It is, 
therefore, critical that instructors monitor the discussion fo- 
rum to provide timely help to students who need it and 
ensure the discussion unfolds in a way that benefits all stu- 
dents. However, nowadays, up to tens of thousands of stu- 
dents can enroll in an online course and create a variety of 
posts that differ by importance, i.e., not all of them warrant 
instructors’ immediate attention. Therefore, it becomes in- 
creasingly challenging for instructors to timely identify posts 
that require an urgent response or to understand how well 
students collaborate in the discussion space. 


To tackle this challenge, various computational approaches 
have been developed across different courses and domains to 
classify educational forum posts, e.g., to distinguish between 
urgent and non-urgent posts [2, 24] or to label posts for dif- 
ferent levels of cognitive presence [11, 52]. Typically, these 
approaches relied upon traditional Machine Learning (ML) 
models, such as Logistic Regression, Support Vector Ma- 
chine (SVM), and Random Forest. These models yielded a 
high level of accuracy, most often due to the extensive efforts 
that domain experts made to engineer input features. For 
post classification tasks, such features are linguistic terms 
describing the post content (e.g., words that represent nega- 
tive emotions) and the post metadata (e.g., a creation times- 
tamp) [31, 38, 19]. 


In recent years, Deep Learning (DL) models have emerged 
as a powerful strand of modeling approaches to tackle data- 
intensive problems. Compared to traditional ML models, 
DL models no longer requires the input of expert-engineered 
features; instead, they are capable of implicitly extracting 
such features from data with a large number of computa- 
tional units (i.e., artificial neurons). Particularly, DL models 
have achieved great success in solving various Natural Lan- 
guage Processing (NLP) problems, e.g., machine translation 
[48], semantic parsing [22], and named entity recognition 
[60]. Driven by this, a few studies have been conducted and 
demonstrated the superiority of DL models over traditional 
ML models in classifying educational forum posts [24, 10, 
59]. For instance, Guo et al. [24] showed that DL models 
can outperform a decision tree based ML model proposed 
in [2] by 0.1 (measured by F1 score) in terms of identifying 
urgent post, while [59] demonstrated that, when determin- 
ing whether a post contains a question or not, the perfor- 
mance difference between SVM and DL models was up to 
0.68 (measured by Accuracy). 


Though achieving high performance, DL models have not 
been justified as an always-more-preferable choice compared 
to traditional ML models. The reasons are threefold. Firstly, 
studies investigating the difference in performance between 
traditional ML and DL models have mostly harnessed a 
limited set of traditional ML models for comparison, with- 
out making extensive feature engineering efforts to empower 
those traditional ML models. As an example, [59] compared 
only SVM to a group of DL models, and the SVM model 
in this study incorporated only one type of features, i.e., 
the term frequency—inverse document frequency (TF-IDF) 
score of the words in a post. This implies that the potential 
of the traditional ML models used in existing studies was 
not fully explored and the actual performance difference be- 


tween the two types of models might be smaller than the 
studies to date have reported on. Secondly, researchers and 
practitioners often need to deliberately trade off several rel- 
evant factors before determining which model they should 
use in practice, and classification performance is only one of 
these factors. Other important factors are the availability 
of human-annotated training data and computing resources 
[29]. For instance, compared to traditional ML models, DL 
models demand a much larger amount of human-annotated 
training data, whose creation can be a time-consuming and 
costly process. Besides, efficient training of DL models re- 
quires access to strong computing resources (e.g., a GPU 
server), which may be unaffordable to researchers and prac- 
titioners with a limited budget. Most traditional ML mod- 
els, on the other hand, can be easily trained on a laptop. 
Thirdly, the feature engineering required by traditional ML 
models plays an important role in contributing to a theoret- 
ical understanding of constructs that are not only useful for 
classification of forum posts, but are also informative about 
students’ discussion behaviors, offering instructors insights 
on whether their instructional approach works as expected 
[45, 12, 58]. 


To assist researchers and educators select relevant models for 
post classification, this study aims at providing a systematic 
evaluation of the mainstream ML and DL approaches com- 
monly used to classify educational forum posts. Throughout 
this evaluation, we advance research in the field by ensur- 
ing that: (i) sufficient effort is allocated to design as many 
meaningful features as possible to empower traditional ML 
models; (ii) an adequate number of representative ML and 
DL models is included; (iii) the effectiveness of selected mod- 
els is examined by using more than one dataset, thus adding 
to the robustness of our approach to different educational 
contexts; (iv) all models are compared in the same exper- 
imental setting, e.g., with same training/test data splits, 
and performance reported on widely-used evaluation met- 
rics to provide common ground for model comparison; and 
(v) the coding schemes used labeling discussion posts are 
made publicly available to motivate the replication of our 
study. Formally, the evaluation was guided by the following 
two Research Questions: 


RQ1 To what extent can traditional ML models accu- 
rately classify educational forum posts? 


RQ2 What is the performance difference between tradi- 
tional ML models and DL models in classifying ed- 
ucational forum posts? 


To answer the RQs, we chose two human-annotated datasets 
collected at two educational institutions: Stanford Univer- 
sity and Monash University. We further conducted the 
evaluation as per the following two classification tasks: (i) 
whether a post requires an urgent response or not; and (ii) 
whether the post content is related to knowledge and skills 
taught in a course. Specifically, to answer RQ1, we first 
surveyed relevant studies that reported on applying tradi- 
tional ML models to classify educational forum posts. We 
hence selected four models that were commonly utilized, i.e., 
Logistics Regression, Naive Bays, SVM, and Random For- 
est. In particular, we collected features frequently employed 
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in the reviewed studies and incorporated them as an input 
to empower the four traditional ML models in our experi- 
ment. Given that these features may play different roles in 
different classification tasks, we further conducted a feature 
selection analysis to shed light on the features that must be 
included in the future application of these models for similar 
classification tasks. 


To answer RQ2, we selected the two widely-adopted DL 
models, Convolutional Neural Network coupled with Long 
Short-Term Memory (CNN-LSTM) and Bi-directional LSTM 
(Bi-LSTM), and compared them to the four selected tradi- 
tional ML models. Recent studies in DL suggested that the 
performance of a model adopted for solving an NLP task 
(CNN-LSTM or Bi-LSTM in our case, denoted as the task 
model for simplicity) can be greatly improved with the aid 
of state-of-the-art pre-trained language models like BERT 
[16] in two ways. Firstly, BERT can be used to transform 
the raw text of a post into a set of semantically accurate 
vector-based representations (i.e., word embedding), which 
comprise the input information for the task model and en- 
able the model to distinguish among multiple characteristics 
of a post. Secondly, BERT can adapt itself to capture the 
unique data characteristics of the task at hand. To this end, 
BERT couples with the task model and learns the model 
parameters. In particular, such flexibility has been demon- 
strated as extremely helpful in the contexts where training 
data was not sufficient. Therefore, we explored the effective- 
ness of BERT in empowering the two DL models selected for 
the experiment. We provide details in Section 3. 


Performance of the four traditional ML and two DL models 
were examined by four evaluation metrics commonly used in 
classification tasks, i.e., Accuracy, Cohen’s «, Area Under 
the ROC Curve (AUC), and F1 score. In summary, this 
study contributed to the literature of the classification of 
educational forum posts with the following main findings: 


e Compared to other traditional ML models, Random 
Forest is more robust in classifying educational forum 
posts; 


e Both textual and metadata features should be engi- 
neered to empower traditional ML models; 


e Different features should be designed when applying 
traditional ML models for different classification tasks; 


e DL models tend to outperform traditional ML models 
and the performance difference ranges from 1.85% to 
5.32% with respect to different evaluation metrics; 


e Using the pre-trained language model BERT benefits 
the performance of DL models. 


2. RELATED WORK 


2.1 Content Analysis of Forum Posts 

Across disciplines, educators widely utilize online discussion 
forums to accomplish different instructional goals. For in- 
stance, instructors often provide an online discussion board 
as a platform for students to ask questions and get answers 
about course content [12, 57], argue for/against a particu- 
lar issue and, in that way, engage deeply with course topics 


[43, 42] or work collaboratively on a course project [49, 13]. 
In this process, instructors monitor student involvement by 
reading their posts. At the same time, instructors judge 
student contributions in the discussion task, e.g., whether 
students asked a question that relates to course content vs. 
a question about semester tuition; described their feelings 
about the discussed problem vs. just rephrased the prob- 
lem; or clearly communicated their ideas to classmates in 
a collaborative learning task. Upon identifying posts that 
do not contribute to the forum at the expected level, the 
instructor may intervene accordingly. Sometimes, such an 
intervention needs to be provided immediately (e.g., in a 
case of a post pointing out the error in the practice exam 


key). 


With the increasing popularity of online discussion forums in 
the instructional context, educational researchers have be- 
come interested in conducting content analysis of students’ 
posts to find evidence and extent of learning processes that 
instructors aimed to elicit in online discussion. To this end, 
researchers utilize coding scheme, a predefined protocol that 
categorizes and describes participants’ behaviors represen- 
tative of the observed educational construct [47, 37], e.g., 
knowledge building [23, 35], critical thinking [39, 35], argu- 
mentative knowledge construction [55], interaction [26, 43], 
social cues, cognitive/meta-cognitive skills and knowledge, 
depth of cognitive processing [26, 25], and self-regulated 
learning in collaborative learning settings [50]. As per the 
analytical procedure, researchers read student postings and 
apply a code over a unit of analysis that can be determined 
physically (e.g., entire post), syntactically (e.g., paragraph, 
sentence) or semantically (e.g., meaningful unit of text) [15, 
47]. Content analysis clearly demonstrated a potential to 
capture relevant, fine-grained discussion behaviors and pro- 
vide researchers and educators with warranted inferences 
made from coding data [46, 47, 28]. 


Manual content analysis is time-consuming [25], especially in 
high-enrollment courses with thousands of discussion posts 
that students create. To automate the process of content 
analysis and support monitoring of student discussion activ- 
ity, various computational approaches have been developed 
for post classification. These approaches relied upon tradi- 
tional ML models and DL models and handled four common 
types of post classification tasks: content, confusion, senti- 
ment, and urgency. Below, we expand upon the studies that 
reported on these tasks. 


2.2 Traditional Machine Learning Models 

Educational researchers have applied traditional ML mod- 
els to automate content analysis of online discussion posts 
for different instructional needs. The ML models we iden- 
tified in this review are predominantly based on supervised 
learning paradigm and can be categorized into four general 
methodological approaches: regression-based (e.g., Logistics 
Regression [1, 61, 57, 2, 62, 36]), Bayes-based (e.g., Naive 
Bayes, [5, 4, 36]), kernel-based (e.g., SVM [45, 12, 5, 18, 
36, 58, 40, 30]), and tree-based (e.g., Random Forest [5, 
2, 36, 31, 19, 38]). These models were designed to predict 
an outcome variable that represented the meaning of dis- 
cussion posts across different categories such as confusion, 
sentiment or urgency. For instance, [12] created an SVM 
classifier to differentiate between content-related and non- 
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content-related questions in a discussion thread to help in- 
structors more easily detect content-related discourse across 
an extensive number of student posts in MOOC, while [1] 
implemented a Logistic Regression classifier to detect confu- 
sion in students’ posts and automatically recommend task- 
relevant learning resources to students who need it. [58] 
applied SVM to detect student achievement emotions [41] 
in MOOC forums and studied the effects of those emotions 
on student course engagement. 


In recent years, researchers became increasingly interested 
in analyzing the expression of urgency (e.g., regarding course 
content, organization, policy) in a discussion post [2]. For 
example, [2] developed multiple ML classifiers to identify 
posts that need prompt attention from course instructors. 
While researchers mostly implemented supervised ML mod- 
els, here we also note a small group of studies that reported 
on using unsupervised methods to classify forum posts, e.g., 
a lexicographical database of sentiments [36] and minimizing 
entropy [8]. 


Traditional ML models built upon textual and non-textual 
features extracted from students’ posts. Textual features 
characterize content of the discussion post, e.g., presence of 
domain specific words [61, 44], presence of words reflective 
of psychological processes [31, 38, 19], term frequency [2, 5], 
emotional and cognitive tone [12, 57, 40, 58, 7, 34, 1], pres- 
ence of predefined hashtags [21], text readability index [62], 
text cohesion metrics [31, 38, 19], and measures of similar- 
ity between message text [31, 38, 19]. Non-textual features, 
on the other hand, include post metadata, e.g., popularity 
views, votes and responses [12, 45, 36], number of unique 
social network users [45, 1, 18], timestamp [45, 36], type 
(post vs. response) [36], variable that signals whether the 
issue has been resolved or not [61], the relative position of 
the most similar post [51], variable that signals whether the 
author of the post is also the initiator of the thread [51], 
page rank value of the author of current post [51], indicator 
if a message is the first or last in a thread [38, 31, 19], and 
structure of the discussion thread [53, 31, 19, 38]. 


Researchers computed a variety of evaluation metrics to as- 
sess performance of these models. Classification accuracy 
was commonly applied in studies that we reviewed (e.g., [12, 
61, 36]). Generally, models achieved classification accuracy 
of 70% to 90% in classifying forum posts across different 
levels of content identification, urgency, confusion, and sen- 
timent. We also note that some authors opted for different 
or additional evaluation metrics, e.g., precision/recall [12, 1, 
62], AUC [12, 18], F1 [1, 2], kappa [1, 61, 57]. Across the 
models, authors utilized a wide range of different validation 
strategies (e.g., cross validation, train/test split). 


We identified two major challenges researchers should be 
aware of when using traditional machine learning approaches 
to detect relevant content, confusion, sentiment and/or ur- 
gency in a discussion forum. First, traditional machine 
learning approaches usually involve extensive feature engi- 
neering. In the context of post classification, a huge num- 
ber of textual and non-textual features of a post is practi- 
cally available to researchers. Features can be generated us- 
ing different text mining approaches (e.g., dictionary-based, 
rule-based) and can be even produced using other classi- 


fiers (e.g.,[1]). Researchers thus often face a challenge to 
decide which feature subset to choose to best capture educa- 
tional problems (e.g., off-topic posting, misinterpreting the 
discussion task, unproductive interaction with peers) and/or 
learning process of interest (e.g., knowledge building, criti- 
cal thinking, argumentation). For this reason, domain and 
learning experts, including course instructors, learning sci- 
entists, and educational psychologists are often needed to 
define a feature space that aligns with the purpose of an 
online discussion. Second, works in [12, 57, 4, 2] took a va- 
riety of different approaches to validate the classifiers they 
developed in terms of metrics, datasets, and training param- 
eters which makes it hardly possible to directly compare the 
performance of these ML models. 


2.3 Deep Learning Approaches 

To our knowledge, relatively fewer studies attempted to ex- 
plore the effectiveness of DL approaches in classifying edu- 
cational forum posts [54, 59, 10, 24, 8, 3, 6]. The DL models 
adopted by these studies, typically, relied on the use of CNN, 
LSTM, or a combination of them. For instance, [54] devel- 
oped a DL model called ConvL, which first used CNN to 
capture the contextual features that are important to discern 
the type of a post, and then applied LSTM to further utilize 
the sequential relationships between these features to assign 
a label to the post. Through extensive experiments, ConvL 
was demonstrated to achieve about 81%~+87% Accuracy in 
classifying discussion posts of different levels of urgency, con- 
fusion, and sentiments. In a similar vein, [59] proposed to 
use Bi-LSTM to better make use of the sequential relation- 
ships between different terms contained in a post (i.e., from 
both of the forward and backward directions). By compar- 
ing with SVM and a few DL models, this study showed that 
Bi-LSTM performed the best in determining whether a post 
contained a question or not (72%~75% Accuracy). 


It is worth noting that the success of DL models often de- 
pends on the availability of a large-amount human-annotated 
data for model training (typically tens of thousands at least). 
This, undoubtedly, limits the applicability of DL models in 
tackling tasks with only a small amount of training data 
(e.g., a few thousand). Fortunately, with the aid of pre- 
trained language models like BERT [16], we can still exploit 
the power of DL models [10]. Pre-trained language mod- 
els aim to produce semantically meaningful vector-based 
representations of different words (i.e., word embeddings) 
by training on a large collection of corpora. For instance, 
BERT was trained on English Wikipedia articles and Book 
Corpus, which contain about 2,500 million and 800 million 
words, respectively. Two distinct benefits were brought by 
such pre-trained language models: (i) the word embeddings 
produced by them encode a rich contextual and semantic 
information of the text and can be well utilized by a task 
model (e.g., ConvL described above) to distinguish different 
types of input data; and (ii) a pre-trained language model 
can be adapted to a specific task by concatenating itself to 
the task model and further fine-tuning/learning their param- 
eters as a whole with a small amount of training data. For 
example, [10] showed that BERT was able to boost classifi- 
cation Accuracy up to 83%~92% when distinguishing posts 
of different levels of confusion, sentiment, and urgency. 


Though gaining some impressive progress, the studies de- 
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Table 1: The features used as input for traditional ML models. The features used to train models are denoted as Yes under 


the column Included 


Category | Feature Description # features | Studies used this feature | Included 
# unigrams Only the top 1000 most frequent unigram/bigrams 2000 [12, 57, 40, 2, 56, 62, 51 
and bigrams are included. 
Post length # words contained in a post. 1 [58, 45, 36, 62 
The term frequency-inverse document frequency Yes 
TIDE (TF-IDF) of the top 1000 most frequent unigrams. a eee 
Textual x = 
aie ‘dase A score € [0, 100] specifying the post readability. 1 [62 
A set of features denoted as scores € [0, 100] 
indicating the characteristics of a post from various 
LIwc textual categories including: language summary, 84 [2, 58, 7, 34, 31, 38, 19 
affect, function words, relativity, cognitive process, 
time orientation, punctuation, personal concerns, 
perceptual process, grammar, social and drives. 
The fraction of words that appeared previously in 
Wordsoyeriep the same post thread. ” [" 
# domain-specific Words selected by expert to characterize a specific No 
7 “c ig ” “ ” = (61, 44 
words subject, e.g., “equation” and “formula” for Math. 
: a Words that are specific to topics discovered by 
Sea applying the topic modeling method Latent Dirichlet - (62, 4, 44 
allocation. 
A set of features indicating text coherence (i.e., 
Coh-Metrix co-reference, referential, causal, spatial, temporal, [31, 38 
and structural cohesion) linguistic complexity, : 
text readability, and lexical category. 
UGA smilanity A score indicating the average sentence similarity 7 (31 
within a message. 
Hashtags pre-defined by instructors to characterize 
Hashtags the type of a post, e.g., #help and #question for - [21 
confusion detection. 
# views The number of views that a post received. 1 [45, 12, 36, 62, 18 
Aabnviieits pGek A binary label to indicate whether a post is 1 (45, 1, 18 Yes 
anonymous to other students. 
Metadata | Creation time The day and the time when a post was made. 2 [45, 36, 18 
#£ votes The number of votes that a post received. 1 [45, 12, 36, 62, 18 
Posierpe A binary label to indicate whether a post is 1 (36, 2 
a response to another post. 
Response time The amount of time before a post was responded. - [18 
# responses The number of responses that a post received. - [45, 12, 36, 18 N 
° 
. : A binary label to indicate whether the issue has 
Discussion status - [61 
been resolved or not. 
A number assigned to a post to indicate its 
Comment Depth chronological position within a discussion thread. 7 [53 
; A binary label to indicate whether the post is the 
Bret and. apt (Post first or the last in a discussion thread respectively. : is 


scribed above were often limited in providing a systematic 
comparison between the proposed DL models and existing 
traditional ML models. In other words, these studies either 
did not include traditional ML models for comparison [10, 
54] or only compared DL models with only one or two tra- 
ditional ML models and the potential of these traditional 
ML models might be suppressed due to a limited amount of 
efforts spent in feature engineering [59]. This necessitates 
a systematic evaluation of the two strands of approaches so 
as to better guide researchers and practitioners in selecting 


models for classifying educational forum posts. 


3. METHODS 


We open this section by describing the datasets used in our 
study. Then, we introduce the representative traditional 
ML models, including the set of features we engineered to 
empower those models (RQ1), and then describe the two 
DL models we chose to compare to the four traditional ML 
models (RQ2). 
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3.1 Datasets 


To ensure a robust comparison between traditional ML and 
DL models in classifying educational forum posts, we adopted 
two datasets in the evaluation, briefly describe below. 


Stanford-Urgency consists of 29,604 forum posts collected 
from eleven online courses at Stanford University. These 
courses mainly cover subjects like medicine, education, hu- 
manities, and sciences. To our knowledge, this dataset is one 
of the few open-sourced datasets for classifying educational 
forum posts and was widely used in previous studies [57, 2, 
5, 10, 56, 24, 21]. In particular, Stanford-Urgency contains 
three types of human-annotated labels, including the degree 
of urgency of a post to be handled by an instructor, the 
degree of confusion expressed by a student in a post, and 
the sentiment polarity of a post. In line with the increasing 
research interest in detecting urgent posts [2, 10, 24, 3], we 
used Stanford-Urgency and focused on determining the lev- 
els of urgency of posts in this study. The count of urgent 
and non-urgent posts is 6,418 (22%) and 23,186 (78%), re- 
spectively. Originally, the urgency label was assigned on a 
Likert scale of [1, 7], with 1 denoting being not urgent at all 
and 7 denoting being extremely urgent, respectively. Sim- 
ilar to previous studies [2], we pre-processed the data by 
treating those of value larger than or equal to 4 as urgent 
posts and those less than 4 as non-urgent posts, and the 
classification task became a binary classification problem. 
It is worth pointing out two notable benefits of including 
Stanford-Urgency: (i) the large number of posts contained 
in Stanford-Urgency provided sufficient training data for DL 
models; and (ii) in addition to the text contained in a post, 
Stanford-Urgency contains rich metadata information about 
the post, e.g., the creation time of a post, whether the cre- 
ator of a post was anonymous to other students, the number 
of up-votes a post received, which enabled us to explore the 
predictive utility of different types of data. 


Moodle-Content was collected by Monash University, the 
dataset contains 3,703 forum posts that students generated 
in the Learning Management System Moodle during their 
coursework in courses like arts, design, business, economics, 
computer science, and engineering. The posts were first 
manually labelled by a junior teaching staff and then in- 
dependently reviewed (and corrected if necessary) by two 
additional senior teaching staff to ensure the correctness of 
the assigned labels. In contrast to Stanford-Urgency, this 
dataset contains labels to indicate whether a post was re- 
lated to the knowledge and skills taught in a course or not, 
e.g., “What is poly-nominal regression?” (relevant to course 
content) vs. “When is the due date to submit the second as- 
signment?” (irrelevant). The count of content-relevant and 
content-irrelevant posts is 2,339 (63%) and 1,364 (37%), re- 
spectively. Therefore, similar to the adoption of Stanford- 
Urgency, we also tackled a binary classification problem 
here. However, it should be noted that, compared to Stanford- 
Urgency, the metadata of posts were not available in Moodle- 
Content. 


3.2 Traditional Machine Learning Models 

Model Selection. To ensure our evaluation is systematic, 
we included representative models that emerged in previ- 
ous studies. As summarized in Section 2.2, the traditional 


ML models commonly investigated to date can be roughly 
grouped into four categories, i.e., regression-based, Bayes- 
based, kernel-based, and tree-based. Therefore, we selected 
one model from each group and explored their capabilities 
in classifying educational forum posts, namely Logistics Re- 
gression, Naive Bayes, SVM, and Random Forest. 


Feature Engineering. Different from previous studies [59, 
5], we argued that traditional ML models should involve an 
extensive set of meaningful features to fully unleash their 
predictive potential before being compared to DL models, 
specifically, we expected that ML models demonstrate im- 
proved performance when utilising more features. There- 
fore, we surveyed studies that reported on applying tradi- 
tional ML models to classify educational forum posts, engi- 
neered features following previous studies and incorporated 
those features into the four traditional ML models, as sum- 
marized in Table 1. These features can be classified into 
two broad categories: (i) textual features that are extracted 
from the raw text of a post with the aid of NLP techniques; 
and (ii) metadata features about a post. As the metadata 
of posts was not available in Moodle-Content, only textual 
features were engineered for this dataset, while both tex- 
tual and metadata features were engineered for Stanford- 
Urgency. We excluded several types of features from the 
evaluation, mainly due to the unavailability of the data re- 
quired to engineer those features, e.g., # domain-specific 
words, and Hashtags. As for LDA-identified words, Coh- 
Metrix, and LSA similarity, we have left these features to 
be explored in our future work. 


Feature Importance Analysis. Previous studies [12, 57, 56] 
have demonstrated the benefits of feature importance anal- 
ysis in providing a theoretical understanding of the underly- 
ing constructs that are useful to classify educational forum 
posts, e.g., identifying features that are useful across differ- 
ent classification tasks. Therefore, we adopted the following 
approach to identify the top k most important features of 
an ML model: 


1. the Chi-squared statistics between engineered features 
and the target classification labels were computed; 


2. each time, the feature of the highest Chi-squared statis- 
tic was fed into the model and the feature was kept in 
the set of input features only if the classification per- 
formance had increased; 


3. we repeated (2) until k most important features were 
identified. 


3.3. Deep Learning Models 

Existing studies on developing DL models to characterize 
different types of forum posts, typically, involved the use of 
CNN or LSTM, which motivated us to include the following 
two DL models to our evaluation: 


e CNN-LSTM [54, 24, 59]. This model consists of: (i) 
an input layer, which learns an embedding representa- 
tion for each word contained in the input test; (ii) a 
CNN layer, which performs a one-dimensional convo- 
lution operation on the embedding representation pro- 
duced by the input layer and captures the contextual 
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information related to each word; (iii) an LSTM layer, 
which takes the output of the CNN layer to make use 
of the sequential information of the words; and (iv) a 
classification layer, which is fully-connected layer tak- 
ing the output of the LSTM layer as input to assign a 
label to the input text. 


e Bi-LSTM [59, 24, 6]. Though LSTM has been demon- 
strated as somewhat effective in utilizing the sequential 
information of long input text, they are limited in only 
using the previous words to predict the later words in 
the input text. Therefore, Bi-directional LSTM was 
proposed, which consists of two LSTM layers, one rep- 
resenting text information in the forward direction and 
the other in the backward direction to better capture 
the sequential information between different words. For- 
mally, this model consists of: (i) an input layer (same 
as CNN-LSTM); (ii) a Bi-LSTM layer; and (iii) a clas- 
sification layer (same as CNN-LSTM). 


Both CNN-LSTM and Bi-LSTM use an input layer to learn 
the representation of the input text, i.e., embeddings of the 
words in a post. Instead of learning word embeddings dur- 
ing training, previous studies [17, 33, 32] suggested that pre- 
trained language models like BERT can be used to initialize 
embeddings. Such embedding initialization has been demon- 
strated as an effective way to facilitate a task model to ac- 
quire better performance. Therefore, we adopted BERT to 
initialize the input layer of both CNN-LSTM and Bi-LSTM 
and, correspondingly, the implemented models are denoted 
as Emb-CNN-LSTM, and Emb-Bi-LSTM, respectively. 


In addition to word embeddings initialization, as suggested 
in recent studies in the field of NLP [17, 33], we can fur- 
ther couple BERT with a task model (i.e., CNN-LSTM or 
Bi-LSTM) and adapt BERT to suit the unique character- 
istics of a task by training BERT and the task model as 
a whole. In other words, the task model is often concate- 
nated on top of BERT’s output for the [CLS], which is a 
special token used in BERT to encodes the information of 
the whole input text. The co-training of BERT and the 
task model enables BERT to fine-tune its parameters to 
produce task-specific word embeddings for the input text, 
which further facilitates the task model to determine a suit- 
able label for the input. In fact, this fine-tuning strategy, 
compared to being used for embedding initialization, has 
been demonstrated as a more promising approach to make 
use of BERT. For instance, [10] showed that, even by sim- 
ply coupling with a classification layer (i.e., the last layer 
of CNN-LSTM and Bi-LSTM), BERT was capable of ac- 
curately classifying 92% forum posts. Most importantly, it 
should be noted that the parameters of the coupled model 
can be well fine-tuned/learned with only a few thousand 
data samples. That means, this fine-tuning strategy enables 
CNN-LSTM and Bi-LSTM to be also applicable to tasks 
that deal with only a small amount of data, e.g., Moodle- 
Content in our case. In summary, we fine-tuned BERT after 
coupling it with CNN-LSTM (CNN-LSTM-Tuned) and Bi- 
LSTM (Bi-LSTM-Tuned), respectively. Besides, to gain a 
clear understanding of the effectiveness of this fine-tuning 
strategy, we coupled BERT with only a single classifica- 
tion layer (denoted as SCL-Tuned) and compared it with 
CNN-LSTM-Tuned and Bi-LSTM-Tuned. Table 2 provides 


a summary of the DL models implemented in this study. 


Table 2: The DL models used in this study. Here, SCL 
denotes Single Classification Layer. 


Models Usage of BERT Task Model 
Eeseaenrey a Fine-tuning | CNN-LSTM Bi-LSTM SCL 

Emb-CNN-LSTM Vv V 

Emb-Bi-LSTM w/. V 


CNN-LSTM-Tuned 
Bi-LSTM-Tuned 
SCL-Tuned 


v 


v 


SS SSIS 


3.4 Experiment Setup 


Data pre-processing. Training and testing data were ran- 
domly split in the ratio of 8:2. The Python package NLTK 
was applied to perform lower casing and stemming on the 
raw text of a post after removing the stop words. 


Evaluation metrics. In line with previous works in classify- 
ing educational forum posts, we adopted the following four 
metrics, i.e., Accuracy, Cohen’s «, AUC, and F1 score, to 
examine model performance. We ran each model three times 
and reported the averaged results. 


Model implementation and training. The traditional ML 
models (i.e., Logistics Regression, Naive Bays, SVM, and 
Random Forest) were implemented with the aid of the Python 
package scikit-learn and their parameters were determined 
by applying grid search and fit the grid to the training 
data. Note all model hyper-parameters will be documented 
in the released GitHub repository. The ML models were 
trained with textual and metadata features for the Stanford- 
Urgency dataset, and trained with textual features for the 
Moodle-Content dataset. When applying the method de- 
tailed in 3.2 to perform feature importance analysis, we used 
F1 score as the metric to measure the changed model per- 
formance. For both CNN-LSTM and Bi-LSTM, the model 
parameters are selected to be comparable with similar pre- 
vious works in [54, 24, 59, 10]. To this purpose, the size of 
the BERT embeddings used in the input layer was 768 and 
the number of hidden units used in the final classification 
layer was 1. We used the activation function sigmoid and 
L2 regularizer. In CNN-LSTM, the CNN layer was set to 
have 128 convolution filters with filter width of 5, while the 
LSTM layer was set to have 128 hidden states and 128 cell 
states. In Bi-LSTM, the number of the hidden states and 
cell states in the LSTM cells was both set to 128. For all DL 
models, (i) 10% of the training data was randomly selected 
as the validation data; (ii) the batch size was set to 32 and 
the maximum length of the input text was set to 512; (iii) 
the optimization algorithm Adam was used; (iv) the learning 
rate was set by applying the one cycle policy with maximum 
learning rate of 2e-05; (v) the dropout probability was set 
to 0.5; and (vi) the maximum number of training epochs 
was 50 and early stopping mechanisms were used when the 
model performance on the validation data starts to decrease, 
and data shuffling was performed at the end of each epoch. 
The best model is selected based on validation error. For 
BERT, we used the service provided by Bert-as-service’. 


‘https: //github.com/hanxiao/bert-as-service 
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Table 3: The performance of traditional ML models. The results in bold represent the best performance in each task. 


Stanford-Urgency Moodle-Content 


Methods Accuracy Cohen’sk AUC Fl | Accuracy Cohen’sk AUC F1 

Naive Bays 0.7536 0.5071 0.7762 0.7844 0.7183 0.4736 0.7210 0.6870 
SVM 0.8627 0.7347 0.8630 0.8185 0.7536 0.5900 0.7536 0.7530 
Random Forest 0.8915 0.7892 0.8916 0.8918 0.7544 0.5927 0.7551 0.7661 
Logistic Regression 0.8068 0.6287 0.8068 0.7638 0.7339 0.5251 0.7357 0.7547 


Table 4: The performance of Random Forest on Stanford-Urgency when using different types of features as input. The results 
in bold represent the best performance. 


Types of Features Accuracy Cohen’sxk AUC Fl 


Textual 0.8639 0.7368 0.8642 0.8652 
Metadata 0.8150 0.6442 0.8152 0.8136 
Textual + Metadata 0.8915 0.7892 0.8916 0.8918 


Table 5: The performance of Random Forest when only using the top-10 most important features (Table 6) as input. The 
fractions within brackets indicate the decreased performance compared to those with all available features as input (Table 3). 


Stanford-Urgency 


Moodle-Content 


Accuracy Cohen’s « AUC Fl 


0.8610 (-3.42%) 0.7315 (-7.31%) 0.8617 (-3.35%) 0.8628 (-3.25%) 


4. RESULTS 


Results on RQ1. The performance of the four traditional ML 
models is presented in Table 3. Across both classification 
tasks, Random Forest achieved the best performance, as per 
the calculated evaluation metrics, followed by SVM and Lo- 
gistics Regression. Naive Bayes, on the other hand, achieved 
the lowest performance. Specifically, Random Forest was ca- 
pable of accurately classifying almost 90% of the forum posts 
in Stanford-Urgency, and reached an AUC and F1 score of 
0.8916 and 0.8918, respectively. Besides, Cohen’s & score 
achieved by Random Forest for the same dataset was 0.7892, 
which indicates a substantial (and almost perfect) classifica- 
tion performance. In terms of classifying Moodle-Content, 
we noticed the overall performance of all models was lower 
than in Stanford-Urgency. This may be attributed to the 
lack of metadata features and significantly fewer posts in 
Moodle-Content than in Stanford-Urgency, making it harder 
for the models to reveal characteristics of different types of 
posts in Moodle-Content. Still, Random Forest achieved an 
overall accuracy, AUC, and F1 score of 0.7544, 0.7551, and 
0.7661, respectively, and Cohen’s «& score was very close to 
0.6, which indicates an almost substantial classification per- 
formance. 


Before delving into the identification of the most predictive 
features, we submitted each group of the textual and meta- 
data features to the best-performing ML model (i.e., Ran- 
dom Forest) to depict their overall predictive power. The 
results are given in Table 4, derived only from Stanford- 
Urgency due to the unavailability of the metadata features 
in Moodle-Content. We observe that both textual and meta- 
data features were useful in boosting classification perfor- 
mance, and textual features seem to have had a stronger ca- 
pacity in distinguishing urgent from non-urgent posts. For 
instance, when only taking textual features into considera- 


Accuracy Cohen’s « AUC F1 


0.7175 (-4.89%) 0.5577 (-5.91%) 0.7186 (-4.83%) 0.7358 (-3.96%) 


tion, the AUC score was 0.8462, which is about 6% higher 
than that of metadata features (0.8152) and only 5% lower 
than that when considering both textual features and meta- 
data features. 


To gain a deeper understanding of the predictive power of 
different features, we further applied the method described 
in Section 3.2 to select the top 10 most important features 
in both Stanford-Urgency and Moodle-Content, described in 
Table 6. Here, several interesting observations can be made. 


Firstly, almost all of the identified features were textual fea- 
tures, with only one exception observed in Stanford-Urgency, 
i.e., the metadata feature # views. This is in line with the 
findings we observed in Table 4, i.e., compared to meta- 
data features, textual features tended to make a larger con- 
tribution in classifying forum posts. Among those textual 
features, we should also notice that most of them were ex- 
tracted with the aid of LIWC. This corroborates with the 
findings presented in previous studies [31, 38, 19], i.e., LIWC 
is a useful tool in identifying meaningful features for char- 
acterizing educational forum posts. 


Secondly, there is little overlap regarding the top ten most 
important features in the two tasks (only two shared feature, 
ie., LIWC: pronoun and LIWC: posemo). In particular, 
we note that the number of features was highly related to 
the context of a classification task. In the Stanford-Urgency 
case, a number of top features were associated with a sense of 
stimulation (e.g., anziety, affect, drive), which represents a 
subjective representation of urgency. In the Moodle-Content 
case, features were more associated with a sense of investi- 
gation (e.g., Analytic and Understand). This shows that dif- 
ferent classification tasks (i.e., Urgency vs. Content-related) 
require task-specific features to best capture the task-specific 
information (i.e., whether the post expressed a sense of ur- 
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Table 6: The top 10 most important features used in Random Forest. Features shared by the two tasks are in bold. 


Stanford-Urgency 


Moodle-Content 


Features 


Description 


Features 


Description 


Metadata: # views 


The number of views that a post received. 


Post length 


# words contained in a post. 


LIWC: pronoun 


# of the occurrence of all pronouns (e.g., personal 
and impersonal pronouns) 


LIWC: Analytic 


A score indicating the formal, logical, and 
hierarchical thinking patterns in a post 


Unigram: they 


# of the occurrence of the word “they” 


LIWC: Tone 


A score indicating the emotional tone 
conveyed in a post 


LIWC: number 


# of the occurrence of the digital numbers 


LIWC: pronoun 


# of the occurrence of all pronouns (e.g., personal 
and impersonal pronouns) 


A score indicating the overal emotion (positive and 


# of the occurrence of all personal pronoun 


WE: Beleat negative) of a post DEW Gs DPLOn: (e.g., he, she, me) in a post 
LIWC: posemo A score indicating the positive emotion of a post Unigram: I # of the occurrence of the word “J” 
LIWC: drives Audeote Snarenting the necde michives, and teesOb LIWC: posemo A score indicating the positive emotion of a post 
a post (e.g., references to success and failure) 
A score indicating the power of a post (e.g., x 
LIWC: power : TF-IDF: understand The TF-IDF score of the word “understand 
reference to dominance) 
ph gee ais : : = fet, A indicating th ity fe joying cl 
LIWC: anx A score indicating the anxiety conveyed in a post LIWC: affiliation ACA IOO WOR ICI A EP OCR ie eam eas 
harmonious relationships conveyed in a post 
LIWC: QMark # of the occurrence of question mark LIWC: Exclam # of the occurrence of exclamation mark 


Table 7: The performance of DL models. 


The results in bold represent the best performance in each task. The fractions 


within brackets indicate the increased performance compared to the best performance achieved by Random Forest (Table 3). 


Models Accuracy Cohen’s & AUC Fl 

1. Emb-CNN-LSTM 0.9203 (3.23%) 0.8192 (3.80%) 0.9201 (3.20%) 0.9203 (3.20%) 
Stanford- 2. Emb-Bi-LSTM 0.9159 (2.73%) 0.8051 (2.01%) 0.9153 (2.66%) 0.9159 (2.71%) 
Urgency 3. CNN-LSTM-Tuned 0.9211 (3.32%) 0.8210 (4.02%) 0.9221 (3.42%) 0.9221 (3.40%) 

4. Bi-LSTM-Tuned 0.9210 (3.30%) 0.8196 (3.85%) 0.9208 (3.28%) 0.9210 (3.27%) 

5. SCL-Tuned 0.9210 (3.31%) 0.8206 (3.98%) 0.9215 (3.35%) 0.9219 (3.38%) 
Tee 6. CNN-LSTM-Tuned 0.7934 (5.17%) 0.6230 (5.11%) 0.7952 (5.32%) 0.7993 (4.33%) 
Caaiea 7. Bi-LSTM-Tuned 0.7854 (4.11%) 0.6220 (4.93%) 0.7901 (4.64%) 0.7913 (3.29%) 

8. SCL-Tuned 0.7716 (2.29%) 0.6092 (2.77%) 0.7733 (2.42%) 0.7803 (1.85%) 


gency). 


Moreover, when solely using the top 10 features as an input, 
the performance of Random Forest was 3.25%~7.31% lower 
than the performance obtained after incorporating all avail- 
able features (Table 5). This finding hence confirms that 
while the traditional ML models can achieve good classifica- 
tion performance using only the top 10 best features, there 
is still potential for improvement when using more features. 
Hence, researchers should attempt to apply more features to 
fully unleash traditional ML models’ capability. 


Results on RQ2. The performance of the implemented DL 

models is presented in Table 7. As Moodle-Content con- 

tained only 3,703 labeled posts, that was likely to be insuffi- 

cient to support the training of CNN-LSTM or Bi-LSTM 

from scratch. Therefore, we only implemented the fine- 

tuned models, i.e., CNN-LSTM-Tuned, Bi-LSTM-Tuned, and 
SCL-Tuned on Moodle-Content. Several observation can be 

derived based on the results in Table 7. 


Firstly and unsurprisingly, DL models uniformly achieved a 


better performance than traditional ML models. This cor- 
roborates findings reported in [59, 54, 33, 24]. DL models 
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are, therefore, superior to traditional ML models in terms 
of capturing the characteristics of a dataset and obtaining 
better classification results. However, we should note that 
the performance difference between traditional ML models 
and DL models was not that large. Specifically, the best- 
performing model CNN-LSTM-Tuned achieved an improve- 
ment of only 3.32% in Accuracy, 4.02% in Cohen’s «, 3.42% 
in AUC, and 3.40% in F1 score. In particular, the Cohen’s 
& score was 0.8210, which suggests an almost perfect classi- 
fication performance. 


Secondly, contrasting findings reported in [59], we found that 
CNN-LSTM slightly outperform Bi-LSTM in most cases (i.e., 
Row 1 vs. Row 2, Row 3 vs. Row 4, and Row 6 vs. Row 7 in 
Table 7). Thirdly, instead of using BERT for embedding ini- 
tialization, the classification model would achieve better per- 
formance by fine-tuning BERT by coupling it with the task 
model and training the coupled model as a whole (i.e., Row 
1-2 vs. Row 3-4 in Table 7), though the improvement was 
rather limited, e.g., less than 1% when comparing to that 
of Emb-CNN-LSTM and CNN-LSTM-Tuned on Stanford- 
Urgency. Fourthly, we showed that in Stanford-Urgency, 
by simply coupling BERT with a single classification layer 
(SCL-Tuned, Row 5 in Table 7), the classification perfor- 
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mance was almost as good as those derived by coupling 
BERT with more complex DL models like CNN-LSTM and 
Bi-LSTM (Row 3-4 in Table 7). This implies that, BERT 
can capture the rich semantic information hidden behind a 
post, which can be used to deliver adequate classification 
performance even by employing a single classification layer. 


5. DISCUSSION AND CONCLUSION 


The classification of educational forum posts has been a 
longstanding task in the research of Learning Analytics and 
Educational Data Mining. Though quite some previous stud- 
ies have been conducted to explore the applicability and ef- 
fectiveness of traditional ML models and DL models in solv- 
ing this task, a systematic comparison between these two 
types of approaches has not been conducted to date. There- 
fore, this study set out to provide such an evaluation with 
aiming at paving the road to researchers and practitioners 
to select appropriate predictive models when tackling this 
task. Specifically, we compared the performance of four rep- 
resentative traditional ML models (i.e., Logistics Regression, 
Naive Bays, SVM, and Random Forest) and two commonly- 
applied DL models (i.e., CNN-LSTM and Bi-LSTM) on two 
datasets. We further elaborate on several implications that 
our work may have on the development of classifiers for edu- 
cational forum posts. We also list limitations to be addressed 
in future studies. 


Implications. Firstly, the performance difference between 
traditional ML models and DL models was not as large as 
reported by previous studies (e.g., [59]). More specifically, 
we showed that traditional ML models were often inferior 
to DL models in terms of only 1.85% to 5.32% decrease in 
classification performance measured by Accuracy, Cohen’s 
«, AUC, and F1 score. This finding implies that, when re- 
searchers and practitioners have no access to strong comput- 
ing resources and, for this reason, cannot utilize DL models, 
they can still achieve acceptable classification performance 
by using traditional ML models, as long as those ML models 
incorporate carefully-crafted features. 


Secondly, our results demonstrate that the performance of 
Random Forest classifier is more robust compared to other 
traditional ML models. This implies that other more ad- 
vanced tree-based ML models (e.g., Gradient Tree Boosting 
[9]) might be worth exploring to achieve even higher clas- 
sification performance. Besides, given that the most im- 
portant feature in Stanford-Urgency was # views (Table 6) 
and the models’ performance in Moodle-Content might be 
suppressed due to the unavailability of metadata features, 
it may be worth paying special attention to acquiring and 
using metadata features when applying traditional ML mod- 
els. Another finding suggests that little overlap was detected 
between the top 10 most important features selected in each 
of the two classification tasks (Table 6). This implies when 
tackling a classification task, features should be designed to 
suit the unique characteristics of the task and fit the theo- 
retical model utilized to annotate data (e.g., with predefined 
coding scheme). This aligns with findings presented in [31, 
38, 19], in different phases of cognitive presence, different im- 
portance scores were obtained for the same features. Lastly, 
researchers and practitioners may wish to take advantage 
of pre-trained language models like BERT when develop- 
ing DL models. Our experiment showed that BERT can be 


effectively used in two ways, i.e., (i) to initialize the word 
embeddings of the post text as the input for a task model; 
or (ii) to suit the needs of the specific classification task 
by coupling itself with the task model and then fine-tuning 
model parameters. Particularly, the second way enables DL 
models to be applicable to tasks that deal with only a small 
amount of human-annotated data, like in Moodle-Content). 


Limitations. Firstly, the evaluation presented in this study 
focused only two classification tasks, i.e., Stanford-Urgency 
and Moodle-Content. To further increase the reliability of 
the presented findings, more tasks should be included and 
investigated, e.g., determining the level of confusion that 
a student expressed in a forum post or whether the senti- 
ment contained in the post is positive or negative [10, 54]. 
Secondly, a few types of features were not included when ex- 
ploring the capabilities of traditional ML models in our eval- 
uation, e.g., # domain-specific words and LDA-identified 
words. 'To accurately depict the upper bound of the perfor- 
mance of traditional ML models in classifying educational 
forum posts, it would be worthy to recruit domain experts 
to further engineer and make use of these features. Thirdly, 
we should notice that the DL models used in our evaluation 
(i.e., CNN-LSTM and Bi-LSTM) only utilized the raw text 
of a post as input and left the metadata features untapped. 
Given that metadata features have been demonstrated of 
great importance in the application of traditional ML mod- 
els, future research efforts should also be allocated to design 
more advanced DL models that are capable of using both 
the raw text of a post and the metadata of the post for 
classification. 


Lastly, we acknowledge that, due to the scope of this study, 
we did not attempt to investigate the reasons causing the 
performance difference between traditional ML models and 
DL models, e.g., whether the two categories of models mis- 
classified the same types of messages. In the future, we 
will further investigate whether the performance difference 
between traditional ML models and DL models can be at- 
tributed to their model structures and explore potential meth- 
ods to boost their classification performance, e.g., collecting 
additional forum posts to continue the pre-training of BERT 
before coupling it with a downstream classification model. 
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